PROCEEDINGS 

OF SCIENCE 



o 

(N 
-i— > 
O 

O 



Oh! 



BQCD - Berlin quantum chromodynamics program 



Yoshifumi Nakamura 

Institutfiir Theoretische Physik, Universitat Regensburg, 93040 Regensburg, Germany 
Center for Computational Sciences, University ofTsukuba, Tsukuba, Ibaraki 305-8577, Japan 



E-mail: yoshi@ccs . tsukuba .ac.jp 



Hinnerk Stuben 

Konrad-Zuse-Zentrum fiir Informationstechnik Berlin, 14195 Berlin, Germany 



E-mail: stueben@zib.de 



> 

0\ 



We publish BQCD as free software under the GNU General Public License. BQCD is a Hybrid 
Monte-Carlo program that simulates lattice QCD with dynamical Wilson fermions. It is one of the 
main production programs of the QCDSF collaboration. The program can simulate 2 and 2 + 1 
fermion flavours with pure, clover improved, and stout smeared fat link Wilson fermions as well 
as standard plaquette, and an improved (rectangle) gauge action. The single flavour is simulated 
with the Rational Hybrid Monte-Carlo algorithm. 
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1. Introduction 

Berlin quantum chromodynamics program (BQCD) is a Hybrid Monte-Carlo [[lj] program for 
simulating lattice QCD with dynamical Wilson fermions. The development of BQCD started in 
1998 for the two flavour case and the standard Wilson action. It was written for a study of parallel 
tempering [^]. At that time the whole parallelisation framework was completed. Soon the program 
was extended in two different directions. The first direction was the implementation of clover 0(a) 
improvement of the fermion action. With the availability of clover improvement BQCD became 
one of the main production codes of the QCDSF collaboration pt]. The second direction was the 
addition of an external field to the standard Wilson action in order to study the Aoki phase [Q]. 
The next milestone was the implementation of the Hasenbusch trick Qj. Starting in 2006 the 
code has been largely extended to enable simulations including a third fermion flavour [Q,|8|, ||, |Tc| ]. 
This extension includes the implementation of Rational Hybrid Monte -Carlo (RHMC) [JTTJ] for the 
simulation of the third quark flavour as well as many algorithmic and performance improvements. 

The code is also being used by the DIK Collaboration for simulations at finite temperature 



[12, 13 1. Several people took BQCD as a starting point for adding their own code for measurements. 



The plan of the QPACE project [ ]14[ ] to port BQCD, in particular the fermion matrix multiplication 
and solvers [15], to their machine has triggered the publication of the code as free software under 
the GNU General Public License on the occasion of this Lattice conference. The source and a 
manual can be downloaded from [[If]]. A description on building and testing binaries can be found 
in the manual. 

2. Actions 

The program can simulate the QCD with the following actions. The gauge action can be the 
Wilson action 

5 = s Wilson = £ ^ReTra-f/praquette) (2.1) 
plaquette 

or a Symanzik improved gauge action 



CO £ TReTrl 1 "^plaquette) +Cl £ 3 Re Tr ( 1 ~ Rectangle ) 
plaquette rectangle 



(2.2) 



with Co + 8ci = 1. The fermion action can be the Wilson action 

5 Wiison = £ {y {x)v{x) _ K [ m ul( x - m + Yll )yr( x -fl) + vKx)C/ M (x)(l - 7m)V(*+ A)] } , 

(2.3) 

the Wilson action plus an explicitly parity-flavour symmetry breaking source term, where T 3 is the 
third Pauli matrix 

S F =S™ s ° n + h£y(x)i Y5 T 3 y(x), (2.4) 



the clover action 



S F = Sf lson - ;Ucs W £^(*)ct mv F M v(*Mjc) , (2.5) 
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the clover action plus a CP breaking term 

S F = Sj ilson - ^fccs W £^(x)a MV /> v (x)^(x) + dy{x)yMx) (2-6) 

or a stout smeared fat link action (any term containing gauge links can be smeared), in particular 
the SLiNC fermion action [|9|] 



Sf = X { Y(x) - K\jr{x)Ul (x - jft) [1 + ft] y(x - A) 

AT 

- k^t(jc)17 m (jc)[1 -y M ]y(* + A) + ^c sw ^(*)<y MV F MV (*)y(*)} , 



(2.7) 



where the gauge links Up are replaced by stout links Q17p 

£/ M ^f7 /i (x)=e i ' G "Wc/ M (x), (2.8) 

with 



2i 



V^xpHx) - U^x)V^x) - \tv (V M (*)Uj(*) - C/ M (x)yJ(x)) 



(2.9) 



3 

where (x) is the sum over all staples associated with the link. Boundary condition for the gauge 
field are periodic in all directions. For the fermions boundary conditions can be chosen to be 
anti-periodic or periodic for each dimension. 

3. Observables 

The following gluonic observables can be measured: the average plaquette and average rectan- 
gular plaquette, the topological charge (the topological charge is measured with the field theoretic 
method after cooling the gauge field configuration), the Polyakov loop. In addition some fermionic 
bulk quantities can be measured (from stochastic estimators): 

(fyy) = :(Tr(M -1 )) ('chiral condensate') 

(WYsY) = t^W^ 1 )) 

(n 2 ) = -^(Tr^M)- 1 } ('pionnorm') 

4. Algorithmic improvements 
4.1 Integrators 

HMC trajectories can be integrated with leapfrog or Omelyan Jl8| ] integrators. Multi timescale 
integration is possible with up to six time scales. In the following we explain a multi timescale setup 
that is used in production for Nf = 2+1 improved Wilson fermions. Starting point is the partition 
function 

Z = J DUDyDye s (4.1) 

S = S g (l5) + S l (K h c sw )+S s (K s ,c sw ) (4.2) 
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where S g is a gluon action, Si is an action for the degenerate u- and <i-quarks and S s is an action for 
the strange quark. After integrating out fermions 

S = S g Q3 ) - In [detMjMi] [detM s + M s ] a . (4.3) 

First even-odd preconditioning is applied 

detM/M/ - det(l + T l 00 f 6&tQ]Qi (4.4) 

[detMjM s ] s - det( 1 + Z£) [det Q+fiJ a (4.5) 

where 



= (1 + r) ee -M eo (l + T^Moe (4.6) 

T = 1 -c^ko^F^. (4.7) 
Then det Q]Qi is separated following Hasenbusch [j|] 

t t GlG/ 

det<2j<2/ = det W/W, det ^=r, W = <2 + p. (4.8) 

WiW t 

Finally the standard action is modified to 

S = S g + S det + S s det + Sfi + S l f 2 + Sf r , (4.9) 

where 

S' det = -2Trlog[l + r oo (jf')] (4-10) 

= -Trlog[l + r oo (K s )] (4.11) 

5}j = 1 t [W(K / ) t W(KO]- 1 0i (4-12) 

S^ 2 = <$>lW{K>)[Q{K>yQ{K l )]- l W{K l y<h (4.13) 

5 A = £ ti+M^Qi^T^'k+i (4.14) 
1=1 



We calculate Sf r using the RHMC algorithm [11] with optimised values for n and the number of 
fractions. Each term of the action is split into one ultraviolet and two infrared parts, 

Suv = S g (4.15) 
•Sir- i = + Sdet + s fi (4- 1 6) 

Sjr-2 = S l f2 + S s fr . (4.17) 

In the leap-frog integrator Suv> Sir-i an d Sir-2 are P ut on three separate time scales, 



(4.18) 



A = Vm-i ( P~) B m > V m -i (P-) (4.19) 
\2mi J \2ni\ J 

B = V uv (^—) Vq (—) Hjv (^—) (4-20) 

where n T = t/(5t) and the Vs are evolution operators of the Hamiltonian. 
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4.2 Solvers 

Besides the standard conjugate gradient (eg) solver BiCGstab and GMRES were implemented. 
Variants with mixed precision arithmetics are available for eg and BiCGstab. In order to reduce 



time spent in the solver chronological inversion J19Q is employed and even-odd preconditioning as 
well as Schwarz preconditioning p0| ] are used. 

5. Implementation details 

The code is mostly written Fortran. The C preprocessor is used for preprocessing in general 
and the m4 macro processor for a few files. A simple mechanism is employed to automatically gen- 
erate multi precision versions from the same source. BQCD is parallelised with MPI and OpenMP. 
The first version of the program was parallelised for a Cray T3E with the shmem library, shmem 
can still be used in the hopping matrix multiplication. 



Random numbers are generated with ranlux pi| , |22|]. Binary data (SU(3) configurations) can 
either be stored in a native BQCD format or in the International Lattice DataGrid (ILDG) [23] 
format. The input parameter file and the log file are simple text files that have a keyword value(s) 
structure. Important parts of the program are instrumented for time profiling and performance 
measurements. 



hopping matrix eg solver 

multiplication (Fortran) 







per core 


overall 


fraction 


per core 


overall 


fraction 


#racks 


#cores 


Mflop/s 


Tflop/s 


of peak 


Mflop/s 


Tflop/s 


of peak 


1/2 


2048 


344 


0.70 


10.1 % 


385 


0.79 


11.3% 


1 


4096 


429 


1.76 


12.6 % 


461 


1.89 


13.6% 


2 


8192 


415 


3.40 


12.2 % 


444 


3.64 


13.1 % 


4 


16384 


407 


6.67 


12.0% 


423 


6.93 


12.4 % 



Table 1: Performance figures for a 48 3 x 96 lattice obtained with the pure Fortran implementation on a 
BlueGene/P. 



hopping matrix eg solver 

multiplication (assembler) 







per core 


overall 


fraction 


per core 


overall 


fraction 


#racks 


#cores 


Mflop/s 


Tflop/s 


of peak 


Mflop/s 


Tflop/s 


of peak 


1/2 


2048 


1057 


2.16 


31.1% 


821 


1.68 


24.1 % 


1 


4096 


1061 


4.35 


31.2% 


802 


3.28 


23.6 % 


2 


8192 


1019 


8.35 


30.0% 


763 


6.25 


22.5 % 


4 


16384 


923 


15.11 


27.1% 


684 


11.21 


20.1 % 



Table 2: Performance figures for a 48 3 x 96 lattice obtained with an assembler implementation of the hop- 
ping matrix multiplication on a Blue Gene/P. 
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48 A 3x96 lattice 
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Figure 1: Scaling plot of performance data from a BlueGene/P given in Tables 1 and 2. The dotted line 
indicates linear scaling. Any linear scaling runs parallel to this line. 



6. Performance 

The program scales very well to large numbers of cores. The pure Fortran case even displays 
some super-linear speedup (see Figure 1). For Blue Gene and Itanium2 assembler implementations 
of the hopping matrix multiplication were provided by Th. Streuer. Performance figures for a 48 3 x 
96 lattice obtained on a Blue Gene/P are given in Tables 1 and 2. The assembler implementations 
makes it possible to overlap communication with computation. This boosts the performance of the 
hopping multiplication of up to a factor of 3.1 and the whole conjugate gradient solver by a factor 
of 1.6 to 2.1 compared with the pure Fortran version. With this code it is possible to run simulations 
at a sustained overall speed of 1 1 .2 Tflop/s 
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